Shareware Grab Bag

home *** CD-ROM | disk | FTP | other *** search

/ Shareware Grab Bag / Shareware Grab Bag.iso / 014 / qsort3.arc / QSORT.DOC < prev next >

Wrap

Text File | 1987-03-12 | 26KB | 524 lines

QSORT - Version 3.00 Copyright (c) 1985, 1986, 1987 - Ben Baker - All rights reserved QSORT may be freely copied and distributed. provided that 1) it is distributed under the name "QSORT," and 2) this documentation file always accompanies it. Vendors wishing to distribute QSORT commercially, or with commercial products may contact the author at the address below for terms. QSORT is made available under the "share-ware" concept. If you find this program useful, you are asked to send its author a license fee of $20 for each machine on which you use it. This will encourage the development of other useful, affordable tools. Send checks to: Ben Baker RR #1, Box 637 E. Alton, Il 62024 Bug reports or suggestions may be sent to the above address or via FidoNet mail to Fido node # 100/76, Baker's Acre, or by logging into Baker's Acre BBS at 618-251-2169. --- INTRODUCTION --- QSORT was first designed to be a replacement for, and to overcome the limitations of DOS SORT, but has been enhanced a number of times and moved to new compilers twice. The current version will sort files whose size is limited only by available disk space. File name(s) may be given explicitly or QSORT will sort from standard input to standard output, and so, may be used in pipes or with redirection. Multiple key fields may be specified. Binary files with fixed-length records may be sorted, provided only that keys are ASCII fields. QSORT tries to be very protective of your data. If QSORT has an error of any kind, it will terminate with the input file still intact, and will return to DOS with ERRORLEVEL = 1. When QSORT successfully completes a sort, it terminates with ERRORLEVEL set to 0. (See the batch commands in your DOS manual for more information on ERRORLEVEL.) The command line syntax is a super-set of SORT's, so QSORT may be used without other changes in batch files using SORT, but in most cases you will probably want to make use of QSORT's greater capabilities. --- The QSORT Command and Options --- QSORT is invoked with the following command: QSORT [<in_file> [<out_file>]] [/<key_spec>. . .] [/T[<tag>] | /F<len>] [/R] [/S] [/?] Note that all parameters on the command line are optional. The <in_file> and <out_file> parameters are "ASCII-Z" file specifiers. They may contain disk and path information in the standard DOS format, but must not contain "wild-card" characters. If <in_file> is missing, QSORT sorts from standard input to standard output. These are files defined and opened by DOS before QSORT is loaded. (See your DOS manual concerning the use of redirection and pipes.) If <in_file> is given but <out_file> is missing, QSORT creates a temporary file in the directory containing <in_file> and sorts to the temporary file. On successful completion of the sort, <in_file> is deleted and the temporary is renamed to <in_file>. The effect is an apparent "sort-in-place." If both file names are given, <in_file> is unchanged and the sorted output is written to <out_file>. This feature was added in version 3.00 as a convenience. New DOS users sometimes have difficulty with the concept of redirection. Note that the following two commands are exactly equivalent: QSORT FILE.TXT FILE.SRT QSORT <FILE.TXT >FILE.SRT In the first, QSORT opens the files. In the second, redirection is specified and DOS opens the files. The result is the same. It is an error QSORT can't detect if you mix these. For instance: QSORT FILE.TXT >FILE.SRT will result in a sort-in-place. QSORT will open FILE.TXT but won't know DOS has opened FILE.SRT for it, and will ignore it. The /<key_spec> Parameter Up to 30 /<key_spec> parameters may be used to specify key fields and are ordered major to minor from left to right. The /<key_spec> argument has the form: /[L][+|-][col][:width] Note that all elements of this argument are "optional," but at least one element must be present following the slant-bar (/). The 'L', if present, specifies "lexical" sequence for this field. (See discussion of lexical sequence below.) The minus (-) sign specifies reverse order for this field. The plus (+) sign (or no sign) specifies normal sort order. If present, [col] defines the beginning column of the field. If omitted, column 1 is assumed. If present, [:width] defines the field width in columns. If width is omitted, the rest of the record is assumed to be part of the key. If no key field parameters are given, the entire record is the key field. When sorting variable-length records, any key field which falls beyond the end of a particular record is treated as a null (zero length) field for that record, and collates low. When sorting fixed-length records, all defined fields must fall within the defined record length. The /F<len> Parameter The /F<len> parameter defines the record length for a file of fixed-length records. All records in the input file MUST be exactly <len> bytes long. The records need not (but may) be terminated with a CR/LF sequence. They may contain any data, even binary data, but the key fields must be ASCII strings. Strings may be terminated with a null (binary zero) character, or may be padded with trailing spaces to the full length of the field. Note that QSORT does not attempt to support Pascal style strings. These are strings which begin with a character whose binary value is a character count. This is followed by <count> characters of ASCII data, which in turn is followed by random data out to the maximum length of the string. These strings may be used as key fields, but the programmer must insure that either the last real character is a null character, or the field is padded to its full length with spaces. QSORT must be told that the key begins in the second character position (the first character of real data). The /T[<tag>] Parameter The /T[<tag>] parameter, if present, indicates that the "records" to be sorted may be more than a single line long. If <tag> is also present, it defines a character to be used to tag the "end-of-record." If <tag> is not present, the first empty line terminates the record. For this purpose, "empty" means "no characters." A line containing but a single space is NOT empty! A line may be "tagged" by placing the <tag> character anywhere on the last line of a logical record. The entire line, including the tag character will appear as the last line of the record. Note that the /F<len> and /T[<tag>] parameters are incompatible, and may not both be specified. The /R Parameter The /R parameter is included for compatibility with DOS SORT and is redundant. It reverses the sense of sort direction for all key fields. The /S Parameter The /S parameter tells QSORT to make a statistics report to the screen at the end of a run. The report is written to the "standard error" device, the console, and may not be redirected. The following is an actual statistics report "cut" from the screen after QSORT had sorted a 1 mega-byte+ file: QSORT - Text Sort - Version 3.00 Copyright (c) 1985, 1986, 1987 - Ben Baker - All rights reserved 10131 records sorted 146 bytes in longest record 116744 sort phase comparisons 53771 merge phase comparisons 170515 total comparisons 16.8 comparisons per input record 21 temporary merge files created 2 merge passes 2.3 average passes over data 6:00 elapsed time The first two numbers are self-explanatory. The next two are the number of times two records were compared during the sort phase and the merge phase respectively, followed by the total comparisons. The next number is total comparisons divided by the number of records in the input file. If there is no merge phase, this number is typically 10 to 12. If the file is large enough to require merging, it is 12 to 20, on average. If it is much larger than 20, it usually means that there is something unusual about your input file. It may already be sorted, or there may be large blocks of records which compare equal. This can happen if you sort on, say column 50 and the input file contains a large number of records shorter than 50 bytes. In this case, a minor sort key at column 1 may significantly speed sorting. The next two items are self-explanatory. "Average passes over data" reflects the number of times each record was read and written. For short files not requiring a merge pass, this number will be 1.0. When merging is needed, the last merge pass is the one which writes the output file and it must read and write every record exactly once. Thus when only one merge pass is made, there will be exactly 2.0 "average passes over data." In the above case the first merge pass processed about 30% of the records, hence the value of 2.3. The above sort was performed on an IBM XT with two hard drives (C and D). The input file was on C; the temporary merge files were placed on D; and the output file was written to C. The sort of a 1 mega-byte file took six minutes flat. With my Tiny Turbo accelerator turned on, the same sort takes about 3:50. The /? Parameter The /? parameter requests help or parameter evaluation. When QSORT is executed with the /? parameter alone, it lists a short description of the QSORT parameters. If /? is entered as one of several parameters, QSORT will produce a short report on the screen describing the sort it would perform based on those parameters without actually doing a sort. For example: QSORT /? /L5:12 /-3:2 /22 /T /R <INFILE.TXT >OUTFILE.TXT requests a sort on a file of tagged records. It defines three key fields, one of them inverted (-). The /R parameter reverses the sense of all three fields. Since redirection is specified, QSORT does not see or know about the names of the files it will sort. The /? parameter requests a report, rather than a sort, and QSORT obligingly produces on the screen, the following: QSORT - Text Sort - Version 3.00 Copyright (c) 1985, 1986, 1987 - Ben Baker - All rights reserved With the present arguments, QSORT would sort from STDIN to STDOUT Records are multiple lines ending with an empty line Key fields in descending order of importance are: Pos Len Type 5 12 Lexical Descending 3 2 ASCII 22 65535 ASCII Descending The third field has an unspecified length. The value "65535" merely means that this field extends to the end of each record. The /M<len> supported in earlier versions of QSORT is no longer required, but will be accepted (and ignored) by QSORT. There is no "hard-coded" maximum record length in QSORT, but there is a practical limit. At some time during every sort, the two longest records in the input file must be compared. Therefore, the two longest records must be able to fit together in the sort buffer. The sum of their lengths cannot exceed about 50K -- not an altogether unreasonable limitation. QSORT can be shoe-horned into tighter memory and will run if it can find 4K for a sort buffer and 4K for an output buffer, but the two longest records must still fit in the sort buffer together. Arguments may appear in any order on the command line except that <in_file> must appear before <out_file>, and /<key_spec> arguments must appear in descending order of importance. --- Lexical Sorting --- The lexical sorting capability was borne out of my own need to sort word lists with mixed capitalization. ASCII sequence produced some bizarre results when words beginning with 'Z' sorted before those beginning with 'a.' Case-insensitive sorting wasn't much better because upper and lower case got mixed randomly. The following table will illustrate what I mean: INPUT ASCII CASE LEXICAL INSENSITIVE DeLaPort Baker Baker Baker Smith Brown brown Brown brown DeAngelo bRown bRown deLaPorte DeLaPort Brown brown Deangelo Deangelo Deangelo DeAngelo deAngelo Deangelo deangelo Deangelo Brown DelaPort Deangelo Deangelo smith DelaPorte deAngelo deAngelo delaPorte Harry DeAngelo deangelo DelaPort Smith delaPort DeLaPort DeAngelo bRown DelaPort DelaPort DelaPorte brown delaPort delaPort deangelo deAngelo DeLaPort delaPort Harry deLaPorte DelaPorte DelaPorte delaPort deLaPorte deLaPorte deLaPorte Baker deangelo delaPorte deLaPorte deLaPorte delaPort deLaPorte delaPorte Deangelo delaPort Harry Harry bRown delaPorte smith Smith delaPort smith Smith smith The first column is a list of names in arbitrary order. The second is an ASCII sort of that list. Third we have one possible case-insensitive sort of the list. The fourth column is what I really wanted. It is sorted the way these words would be sorted in a dictionary (or lexicon). The third and fourth columns both collect words of identical spellings together, but in the third column, upper and lower case spellings are in arbitrary order, while the fourth column places upper case spellings ahead of lower case spellings. For example, the two occurrences of Smith are widely separated in column 2 because one is capitalized and the other is not. Column 3 brings the two together, but in the wrong order. They might have been in the right order, but the order is strictly arbitrary. In column 4, Smith comes before smith, and lexical sorting will always put them in this order. Lexical sorting is achieved by making case-insensitive comparisions of entire fields. If the fields compare equal, an ASCII comparison is made to arbitrate ties. In other words, when "lexical" fields in two records have different spellings, the case-insensitive comparison determines the order of the records. When "lexical" fields are spelled the same, the case-sensitive comparison determines the order of the records. Lexical fields are defined, as indicated above, by placing the letter 'L' immediately following the slant-bar (/) in <key_spec> definitions. Lexical sorting can be very useful when needed, but be aware that unnecessarily specifying lexical ordering may degrade performance of QSORT. --- Examples --- Produce a sorted directory listing and display it on the console a screen's worth at a time: A>DIR | QSORT | MORE This demonstrates the use of QSORT as a "filter" in a "pipe." Produce a directory listing sorted by creation date and time, and display it on the console a screen's worth at a time: A>DIR | QSORT /30:2 /24:5 /39 /34:5 | MORE The output of the DIR command is piped to QSORT. The key fields defined are, from left to right (major to minor), year (2 digits), month and day, AM/PM flag and time. The output of QSORT is then piped to MORE for display. Next, replace the unsorted FILE.TXT with the same data sorted in reverse order. Use columns 10 to 16 as the sort key: A>QSORT FILE.TXT /-10:7 or A>QSORT FILE.TXT /10:7 /R or (SORT compatible) A>QSORT FILE.TXT /R /+10 Next, perform a simple sort on a file with up to 240-byte records A>QSORT LARGE.REC /M240 or A>QSORT LARGE.REC Note that the "/M240" parameter is no longer needed, but will not hurt. GLOSS.TXT is an unsorted glossary of terms. The term being defined by each entry appears first, followed by several lines of definition. The entries are separated by empty lines. Produce GLOSS.SRT, a sorted version of the glossary: A>QSORT /T <GLOSS.TXT >GLOSS.SRT (with redirection) or A>QSORT /T GLOSS.TXT GLOSS.SRT (without redirection) A lawyer keeps a running log of his billable activities in TIME.LOG. The first line of each entry is "mm/dd/yy hh:mm account#." He always places a tilde (~) in the last line of each entry. He wishes to sort the log by account number, and by ascending date and time within each account: QSORT /16:7 /7:2 /1:5 /10:5 /T~ TIME.LOG The directory of users for a bulletin board system is kept in a binary file of fixed-length records 180 bytes long. The user name is a 26-character field beginning in the first position and the city/state field is a 16-character field beginning in the fortieth position. Sort the file by city/state and name. QSORT /F180 /40:16 /1:26 USER.BBS --- Implementation Notes --- QSORT is intended as an enhanced replacement for DOS SORT. It is nearly fully upward compatible, but provides much more flexibility. Multiple sort keys may be specified, a pseudo in-place sort may be performed and files and/or records of any size may be sorted provided only that there is sufficient disk space for work files and the output file. QSORT uses the "quick sort" algorithm, which cannot guarantee the order of records whose keys are all equal. This is the one "incompatibility" with DOS SORT, which retains the original order of records when its only key compares equal. This is important to SORT because it must be invoked multiple times to effect a multiple key sort. With QSORT, you only sort once and there are usually enough keys available to insure you get the order you want the first time. QSORT uses a sort buffer of about 50K bytes and will fill the buffer as full as possible, and then sort the contents of the buffer. If it has reached the end of the input file and has not yet generated any work files, the contents of the buffer are written to the output file, completing the sort operation. If the input file is too large to fit into the sort buffer, as much of the input file as possible is read into the buffer, sorted, then written to a temporary work file. This process is repeated as many times as necessary to process the entire input file, each time creating a new work file for the sorted output. Upon completion of the "sort phase," QSORT begins a "merge phase." Each work file is a sorted sub-set of the input file. Thus, work files may be read sequentially and combined to produce a sorted output. QSORT will open as many work files as DOS permits (more on this later). If all the remaining work files can be opened, the sorted result is written to the output file. Otherwise, a new work file is created and another merge pass will be required. On each merge pass, the number of work files is reduced and eventually all remaining work files will be opened and the sorted output file will be written completing the sort operation. QSORT is smart enough to never have just one work file remaining, which would require an unnecessary copy operation. In fact, QSORT is smarter than just that in its handling of the merge phase. If more than one merge pass is required, all the data merged during the first pass will have to be merged again, so QSORT attempts to minimize the first pass. For example, if QSORT discovers it may only open 15 files at a time, and there are 16 temporary files, it will only merge two files on the first pass, creating a 17th file as it does. Then in the second pass, it will merge all 15 remaining files to the output file. The less data it processes twice, the faster it performs the sort! With nothing else to guide it, QSORT places its temporary files in the default directory. Either of two "environment variables" can override this. (See your DOS manual for information on environment variables and the SET command.) The command: SET QSTMP=<path> or SET TMP=<path> will define a path for QSORT to use for its temporaries. QSORT first looks for the environment variable QSTMP. If it does not exist, QSORT next looks for TMP. TMP is a de facto standard used by many programs, and is usually defined in your AUTOEXEC.BAT batch file. You might have TMP specifying a 64K RAM disk to speed up your compiler. In this case, an attempt to sort a 100K file is doomed to failure. Rather than redefine TMP, you may define QSTMP to force QSORT to use some directory on your hard disk. In fact: SET QSTMP=. tells QSORT to always use the default directory anyway! QSORT, to work properly, needs enough space on the output disk to hold the output file. Even if the input file is to be deleted and resides in the same directory, that is not done until after the output file has been successfully written. If one merge pass is required, the disk QSORT uses for temporaries should have about 10% more space than the size of the input file for temporary merge files. If more than one merge pass will be required, allow about twice the size of the input file as temporary merge file space. One of the advantages of controlling where QSORT places its temporary files is to insure adequate space for them. A second is speed. If the temporary files can be placed on a separate disk from the input and output files, disk seeking is minimized and performance improved. Each time QSORT must create a new temporary merge file, the data put into it will be processed again. Obviously, the more files QSORT can open during the merge phase, the fewer times it will have to handle each record and the faster it can sort large files. If DOS is properly pre-conditioned, QSORT can have up to 15 temporary merge files open at once, and very large files can be sorted with just one sort pass and one merge pass. Unfortunately, that capability is not automatic. DOS has a fixed number of file "handles" that it associates with open files. The default number is eight, but DOS opens five of them for standard input, standard output, standard error, standard printer and standard auxiliary device. That leaves three for merging. A 250K input file would produce five temporary merge files and that would take three merge passes; merge two into one, leaving four; merge two into one leaving three; and finally merge three into the output file. In the process, QSORT must read and write about 80% of the file twice during the merge phase. Worse yet, since you need at least three handles for merging, if you have resident programs that have open files, you can't merge at all! DOS can be told to set aside more space for file handles. Each handle is only 39 bytes and it's memory very well spent. One process can have a maximum of 20 handles open at one time, but since resident processes may be using handles, I recommend 25 to 35. To do this, the root directory of the DISK OR DISKS YOU BOOT FROM must contain a file named CONFIG.SYS. If your boot disk(s) already contains a CONFIG.SYS, edit it, or if not, create it to contain the following line: FILES=25 (or more) While we're at it, let's add one more thing to CONFIG.SYS which will improve the performance of QSORT and many other programs as well. DOS provides, by default, two disk buffers. These are the buffers it uses to do its disk reads and writes. During the merge phase QSORT may have many files open at once, reading from them in more or less random order. DOS may have to read the same physical sector several times to get all its data. But DOS can remember what's in each buffer and where it came from, and will not re-read a sector it already has in a buffer. DOS needs 528 bytes for each buffer. I recommend 20 buffers to make QSORT perform well under the most adverse conditions. This will require an additional 9504 bytes or slightly more than 9K, again memory well spent, so we add to CONFIG.SYS the following line: BUFFERS=20 See your DOS manual for more information on CONFIG.SYS. I hope you find this program useful. Your comments and suggestions are welcome. My address is at the top of this file.